feat(observability): measure response tokens and cache status#1063
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d20405b6f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const metrics = getMetricsCollector(); | ||
| const bytes = estimateResultPayloadBytes(result); | ||
| metrics.observe('openchrome_tool_output_bytes', withTenantLabel({ tool: toolName }), bytes); | ||
| metrics.observe('openchrome_tool_estimated_tokens', withTenantLabel({ tool: toolName }), estimateOutputTokensFromChars(bytes)); |
There was a problem hiding this comment.
Compute token estimate from character count
recordToolOutputObservability passes the UTF-8 byte length into estimateOutputTokensFromChars, but that heuristic is documented and named as chars/4. For non-ASCII responses (e.g., CJK or emoji-heavy outputs), byte length is larger than character count, so openchrome_tool_estimated_tokens is systematically inflated and no longer comparable across languages.
Useful? React with 👍 / 👎.
| const status = typeof obj.status === 'string' ? obj.status : typeof obj.cacheStatus === 'string' ? obj.cacheStatus : null; | ||
| if (!status) return null; | ||
| const keyVersion = obj.keyVersion ?? obj.version ?? 'unknown'; | ||
| return { status: status.toUpperCase(), keyVersion: String(keyVersion) }; |
There was a problem hiding this comment.
Normalize cache key version metric label
extractCacheStatus forwards keyVersion as an arbitrary string, and that value is later emitted as the key_version label in openchrome_cache_status_total without any bounding/normalization. If a tool reports per-request versions (hashes/timestamps), this introduces unbounded label cardinality and can degrade Prometheus memory/query performance.
Useful? React with 👍 / 👎.
d20405b to
a6abe02
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6abe02838
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (cache) { | ||
| metrics.inc('openchrome_cache_status_total', withTenantLabel({ | ||
| tool: toolName, | ||
| status: cache.status, |
There was a problem hiding this comment.
Constrain cache status labels before emitting metrics
recordToolOutputObservability emits openchrome_cache_status_total with status taken directly from tool output via extractCacheStatus (only uppercased), so any tool that includes request-specific detail in cache status (for example reason strings or embedded IDs) will create unbounded label cardinality and degrade Prometheus memory/query performance. Please normalize this to a small allowlist (e.g., HIT/MISS/BYPASS/ERROR) or map unknown values to a fixed bucket before incrementing the counter.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d8bb37afb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (raw && typeof raw === 'object') { | ||
| const obj = raw as Record<string, unknown>; | ||
| const status = typeof obj.status === 'string' ? obj.status : typeof obj.cacheStatus === 'string' ? obj.cacheStatus : null; | ||
| if (!status) return null; |
There was a problem hiding this comment.
Continue cache-status fallback instead of returning early
When raw is an object but does not contain a string status/cacheStatus, this early return null prevents the later structuredContent.cacheStatus fallback from running. In mixed-shape results (for example, cache: { keyVersion: "v2" } plus structuredContent.cacheStatus: "miss"), openchrome_cache_status_total is silently dropped even though a valid cache status is present.
Useful? React with 👍 / 👎.
Constraint: OpenChrome must expose token/output savings without adding tokenizer or billing dependencies. Rejected: Provider-specific tokenizers | they add dependency and billing semantics outside this MCP server boundary. Confidence: medium Scope-risk: narrow Directive: Keep metric labels bounded; never add URL, selector, instruction, or domain labels for these metrics. Tested: /Users/jh0927/openchrome/node_modules/.bin/tsc -p tsconfig.json --pretty false; npx jest --config jest.config.js --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts Not-tested: Live /metrics smoke against a running OpenChrome instance
5d8bb37 to
94f4324
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94f4324d1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (typeof originalChars === 'number' && typeof compressedChars === 'number' && originalChars > compressedChars) { | ||
| metrics.observe( | ||
| 'openchrome_tool_compression_saved_bytes', | ||
| withTenantLabel({ tool: toolName, mode: level }), |
There was a problem hiding this comment.
Normalize compression mode labels before emitting metrics
recordToolOutputObservability forwards _compression.level directly into the mode label for openchrome_tool_compression_saved_bytes. Because MCPResult is free-form, any tool can return request-specific strings in _compression.level, which creates unbounded label cardinality and can degrade Prometheus memory/query performance once compression stats are emitted. Please map this label to a small allowlist (and bucket unknowns) before calling observe.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94f4324d1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ], | ||
| isError: true, | ||
| }; | ||
| this.recordToolOutputObservability(toolName, forbiddenResult); |
There was a problem hiding this comment.
Bound tool labels before recording denied-call output metrics
recordToolOutputObservability is called on the scope-denied path before verifying that toolName is a registered tool, so a read-only API key can send arbitrary params.name values and create unbounded new time series in openchrome_tool_output_bytes / openchrome_tool_estimated_tokens (and potentially cache-status metrics). This introduces a cardinality-amplification vector from user input and can degrade Prometheus memory/query performance; normalize unknown names to a fixed bucket (e.g. unknown) or only emit these metrics after successful tool lookup.
Useful? React with 👍 / 👎.
Merges the latest develop test-infra repairs so PR #1063 is evaluated against the branch it targets without changing the bounded metrics feature itself. Constraint: Remote CI failed on stale develop test expectations unrelated to token observability. Rejected: Editing unrelated doctor/tab tests in this PR | the fixes already landed on develop and should be consumed as base updates. Confidence: medium Scope-risk: narrow Directive: Keep output observability labels bounded; do not add URL, selector, instruction, or domain labels. Tested: pending rerun after dependency install; remote CI required for full matrix. Not-tested: Full CI matrix locally. Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5f9adab7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const originalChars = (compression as Record<string, unknown>).originalChars; | ||
| const compressedChars = (compression as Record<string, unknown>).compressedChars; | ||
| const level = String((compression as Record<string, unknown>).level ?? 'unknown'); | ||
| if (typeof originalChars === 'number' && typeof compressedChars === 'number' && originalChars > compressedChars) { |
There was a problem hiding this comment.
Preserve compression stats before observing saved bytes
recordToolOutputObservability expects _compression.originalChars and _compression.compressedChars, but the success path currently rewrites _compression to { level, verbosity } before this method runs when compression.trackSavings is enabled. In that configuration, any tool-provided compression stats are discarded, so openchrome_tool_compression_saved_bytes never records even when a tool returns valid savings data. Please merge/retain existing _compression fields instead of replacing the object.
Useful? React with 👍 / 👎.
Records original and delta response sizes on real read_page DOM delta responses so compression savings metrics are populated from actual tool output instead of inert placeholder metadata. Constraint: PR #1063 promised compression savings observability without adding tokenizer or billing dependencies. Rejected: Provider-specific token accounting | this remains a bounded payload-size heuristic within the MCP server boundary. Confidence: high Scope-risk: narrow Directive: Keep output metric labels bounded and never include URL, selector, instruction, or domain labels. Tested: npm test -- --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts; ./node_modules/.bin/tsc -p tsconfig.json --pretty false Not-tested: Live /metrics smoke against a running OpenChrome instance. Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d45a2def9a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| originalChars: outputText.length, | ||
| compressedChars: compressedText.length, |
There was a problem hiding this comment.
Count full uncompressed response when reporting delta savings
_compression.originalChars is set from outputText.length, but the non-delta DOM response for the same call is outputText + domPaginationSection. When includePagination is left at its default (true) and pagination metadata is non-empty, this undercounts the baseline and can make originalChars <= compressedChars, so openchrome_tool_compression_saved_bytes is underreported or dropped even when delta compression actually saved bytes.
Useful? React with 👍 / 👎.
Cleans up module mocks used by the read_page compression-savings regression test so full-suite CI workers do not inherit mocked session or DOM modules. Constraint: CI runs many Jest files in shared workers, so test-local doMock calls must not leak beyond this suite. Rejected: Removing the real read_page delta-path test | it is needed to prove the compression-savings metric is populated. Confidence: high Scope-risk: narrow Directive: Keep observability tests hermetic when mocking runtime modules. Tested: npm test -- --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts Not-tested: Full Jest suite locally after this test-isolation fix. Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53347f80b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| metrics.observe( | ||
| 'openchrome_tool_compression_saved_bytes', | ||
| withTenantLabel({ tool: toolName, mode: level }), | ||
| originalChars - compressedChars, |
There was a problem hiding this comment.
Convert compression savings metric to true byte deltas
openchrome_tool_compression_saved_bytes is recorded from originalChars - compressedChars, but those _compression fields are character counts (for example read_page sets them from .length). For non-ASCII output (CJK, emoji, etc.), character deltas can differ substantially from UTF-8 byte deltas, so this metric reports incorrect units and can mislead dashboards/alerts that depend on byte savings. Compute the delta from UTF-8 byte lengths (or rename the metric to chars) before observing it.
Useful? React with 👍 / 👎.
Constraint: PR #1063 inherits the expanded Tier 1 tool surface from develop. Rejected: Loosening the cursor verification assertion | it would stop catching accidental tool-surface drift. Confidence: high Scope-risk: narrow Directive: Keep the exact count aligned whenever Tier 1 tool registration intentionally changes. Tested: CI=true npx jest --config jest.ci.config.js --runInBand tests/cross-env/cursor-verification.test.ts Not-tested: Full CI matrix pending on GitHub runners. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review |
Constraint: PR #1063 must pass Windows CI where Git can check out JSON fixtures with CRLF line endings. Rejected: Changing the v1.11.0 fixture payload | the response shape did not change. Confidence: high Scope-risk: narrow Directive: Preserve strict JSON shape assertions while normalizing platform line endings at comparison time. Tested: CI=true npx jest --config jest.ci.config.js --runInBand tests/tools/console-capture-regression.test.ts tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.ts Not-tested: Full CI matrix pending on GitHub runners. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review |
Merge current develop into the token-observability PR while preserving delta-compression metadata and the newer node_refs DOM output contract. The admin-key plaintext assertion remains resilient to unrelated worker console noise from CI. Constraint: PR #1063 was merge-conflicting with develop and CI failed on macOS admin-key stdout noise before this review pass. Rejected: Treating Jest console noise as CLI plaintext output | the CLI contract is about exactly one key token, not every captured stdout line. Confidence: high Scope-risk: moderate Directive: Preserve machine-readable compression metadata when adding additional DOM response sections. Tested: npm ci; npm run build; npm run lint; npm run lint:tier; npm run lint:tool-schemas; npx jest tests/cli/admin-keys.test.ts --runInBand --forceExit; git diff --check Not-tested: Full multi-OS GitHub Actions matrix before push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review 최신 develop 병합 및 macOS admin-key 테스트 안정화 반영 완료:
검증:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed21ebaf73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| content: [{ type: 'text', text: compressedText }], | ||
| _compression: { | ||
| level: 'delta', | ||
| originalChars: outputText.length, |
There was a problem hiding this comment.
Compute original delta size from full DOM response
In the delta path, _compression.originalChars is derived from outputText.length, but the fallback non-delta DOM response includes nodeRefsBlock and optional pagination text as well. On pages with large [node_refs] sections (or pagination metadata), this underestimates the baseline size and can hide real compression savings in observability metrics. Use the same full-response text shape used by the non-delta return when calculating the original size.
Useful? React with 👍 / 👎.
| metrics.observe( | ||
| 'openchrome_tool_compression_saved_bytes', | ||
| withTenantLabel({ tool: toolName, mode: level }), | ||
| originalChars - compressedChars, |
There was a problem hiding this comment.
Measure compression savings in bytes before reporting
This metric is named openchrome_tool_compression_saved_bytes, but it records originalChars - compressedChars from _compression. Since these values are currently based on string .length (UTF-16 code units), responses containing multi-byte UTF-8 characters will report incorrect byte savings. Convert both values to UTF-8 byte lengths (or rename the metric) so dashboards reflect actual payload reduction.
Useful? React with 👍 / 👎.
Merge the latest develop crawl-tool additions into PR #1063 and apply the same pipe-safe schema lint handling now required by the larger tool manifest. Constraint: develop advanced after the first push and the manifest now exceeds the old pipe-safe size envelope on this branch too. Rejected: Leaving the branch green only by current byte size | subsequent develop merges can cross the same threshold again. Confidence: high Scope-risk: moderate Directive: Introspection commands must flush stdout and schema lint must stream stdin for large manifests. Tested: npm run build; npm run lint:tool-schemas; npx jest tests/cli/admin-keys.test.ts --runInBand --forceExit; git diff --check Not-tested: Full multi-OS GitHub Actions matrix before push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review develop 최신 변경(crawl_start/status/cancel)까지 추가 병합하고, 커진 manifest에서도 lint:tool-schemas가 안정적으로 동작하도록 pipe-safe introspection/schema-lint 수정을 이 PR에도 반영했습니다. 추가 검증:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71054b8664
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| originalChars: outputText.length, | ||
| compressedChars: compressedText.length, |
There was a problem hiding this comment.
Compute delta compression savings from comparable payloads
The _compression metadata for read_page delta responses currently compares originalChars from outputText.length against compressedChars from compressedText.length, but compressedText includes nodeRefsBlock/pagination sections while originalChars does not. In calls where those shared sections are non-trivial, this undercounts (or can invert) savings, which then suppresses or distorts openchrome_tool_compression_saved_bytes in recordToolOutputObservability. This makes the new observability metrics inaccurate specifically for DOM delta mode responses with larger node-ref or pagination blocks.
Useful? React with 👍 / 👎.
Make the HTTP bearer auth regression allocate its port range from the Jest process id before using per-case ports, avoiding collisions across parallel worktrees and CI shards. Constraint: PR #1063 CI failed in the HTTP auth suite with socket hangups despite the targeted admin-key fix being green. Rejected: Ignoring unrelated full-suite transport flakes | merge readiness requires the matrix to stay green. Confidence: high Scope-risk: narrow Directive: Keep transport tests independent of globally fixed loopback ports. Tested: npx jest tests/transports/http-bearer-auth.test.ts --runInBand --forceExit Not-tested: Full multi-OS GitHub Actions matrix before push. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
@codex review HTTP bearer auth 테스트 고정 포트 충돌을 process별 포트 range로 보강한 최신 커밋까지 리뷰 부탁드립니다. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@codex review\n\nAddressed current P1 observability feedback: output/token/compression/cache metrics now bucket unregistered tool names as |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ecc27f6abf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| metrics.observe('openchrome_tool_output_bytes', withTenantLabel({ tool: toolName }), bytes); | ||
| metrics.observe('openchrome_tool_estimated_tokens', withTenantLabel({ tool: toolName }), estimateOutputTokensFromChars(payload.length)); |
There was a problem hiding this comment.
Bucket unregistered tool labels before emitting metrics
recordToolOutputObservability emits tool labels using raw toolName, but this function is now called on tenant/scope-denied paths before tool lookup (src/mcp-server.ts around lines 1295-1342), where toolName comes directly from request input. That lets callers generate unbounded time-series cardinality by sending many distinct bogus tool names, which can exhaust metrics memory and scrape performance; unregistered names should be normalized to a bounded value (e.g., unknown) before emitting openchrome_tool_output_bytes, openchrome_tool_estimated_tokens, openchrome_tool_compression_saved_bytes, and openchrome_cache_status_total.
Useful? React with 👍 / 👎.
Summary
Closes #996.
Alignment / scope review
This stays aligned with OpenChrome's direction: no server-side LLM, no tokenizer dependency, no billing integration, and no persisted response bodies. It measures only payload sizes/status after response shaping and redaction.
Verification
/Users/jh0927/openchrome/node_modules/.bin/tsc -p tsconfig.json --pretty falsenpx jest --config jest.config.js --runInBand tests/metrics/tool-output-observability.test.ts tests/metrics/cookie-scan-metrics.test.tsPost-merge smoke
Run the issue's OpenChrome MCP verification script/manual flow against
/metricsto confirmopenchrome_tool_output_bytes,openchrome_tool_estimated_tokens, and cache-status counters appear with bounded labels.